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Abstract 

Most people in the world (-90%) are infected by the Epstein-Barr virus (EBV), which establishes itself permanently in B cells. Infection 
by EBV is related to a number of diseases including infectious mononucleosis, multiple sclerosis, and different types of cancer. So far, 
only seven complete EBV strains have been described, all of them coming from donors presenting EBV-related diseases. To perform a 
detailed comparative genomic analysis of EBV including, for the first time, EBV strains derived from healthy individuals, we recon- 
structed EBV sequences infecting lymphoblastoid cell lines (LCLs) from the 1000 Genomes Project. As strain B95-8 was used to 
transform B cells to obtain LCLs, it is always present, but a specific deletion in its genome sets it apart from natural EBV strains. After 
studying hundreds of individuals, we determined the presence of natural EBV in at least 1 0 of them and obtained a set of variants 
specific to wild-type EBV. By mapping the natural EBV reads into the EBV reference genome (NC007605), we constructed nearly 
complete wild-type viral genomes from three individuals. Adding them to the five disease-derived EBV genomic sequences available in 
the literature, we performed an in-depth comparative genomic analysis. We found that latency genes harbor more nucleotide 
diversity than lytic genes and that six out of nine latency-related genes, as well as other genes involved in viral attachment and 
entry into host cells, packaging, and the capsid, present the molecular signature of accelerated protein evolution rates, suggesting 
rapid host-parasite coevolution. 
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Introduction 

Human herpesvirus type 4, or Epstein-Barr virus (EBV), is a 
dsDNA virus with a genome size of around 1 70 kb. The EBV 
infects more than 90% of adults worldwide (Chang et al. 
2009). Usually, the primary infection occurs early in life and 
is asymptomatic in children, while in young adults can produce 
infectious mononucleosis (IM). EBV preferentially infects B cells 
but can also attach to and enter into epithelial cells (Borza and 
Hutt-Fletcher 2002). Upon infection, EBV establishes life-long 



latency in the form of episomes residing in the host cell's nu- 
cleus. Only a small subset of the -90 coding regions of the 
virus are expressed in latency: six nuclear proteins {EBNA-1 , -2, 
-3A, -3B, -3C, and -LP) and three membrane proteins (LMP-1 , 
-2A, and -2B) (Young and Rickinson 2004). 

Infection by EBV has been related to a number of complex 
diseases including multiple sclerosis (MS) and different types of 
cancer such as Burkitt's lymphoma (BL), Hodgkins' lymphoma 
(HL), or nasopharyngeal carcinoma (NPC) and gastric 
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carcinoma (GC). The prevalence of these diseases varies with 
geography. For instance, NPC is much more prevalent in Asia, 
whereas BL is found most commonly in equatorial Africa 
(Chang et al. 2009); another notorious example is given by 
MS, which has greater prevalence in higher latitudes in 
Caucasian populations, while it is extremely rare in the 
Tropics (Ascherio and Munger 2010). A number of studies 
have tried to relate genetic variability of EBV strains to the 
prevalence of these diseases or to broadly defined geograph- 
ical regions. Such efforts have focused on sequencing genes 
that play important roles in essential viral processes, such as 
BZLF1 (Gutierrez et al. 2002; Martini et al. 2007), EBNA-1 
(Habeshaw et al. 1999; Brennan et al. 2010; Lorenzetti 
et al. 2010; Wang, Liu, Xing et al. 2010), EBNA-2 (Aitken 
et al. 1994), EBNA-3A, -3B, and -3C (Gorzer et al. 2006), 
LMP-1 (Edwards et al. 1999, 2004), and LMP-2a (Wang, Liu, 
Jia et al. 2010). However, the few studies that have classified 
EBV strains using combinations of variants from more than 
one gene suggest that recombination does take place and 
that many variant combinations are possible, thus increasing 
the difficulty of typing EBV strains (Gutierrez et al. 2000; 
Chang et al. 2009; Sawada et al. 201 1). 

The study of the worldwide distribution of the genetic var- 
iability of EBV and its potential contributions to the risk of 
disease would benefit from the analysis of whole EBV ge- 
nomes. So far, only seven EBV strains have been completely 
sequenced: the B95-8/Raji strain from an American patient of 
IM; AG876 and MUTU from African individuals with BL; and 
the GD1, GD2, HKNPC1, and AKATA strains from Asian indi- 
viduals with NPC or BL (Baer et al. 1984; Zeng et al. 2005; 
Dolan et al. 2006; Liu et al. 201 1 ; Kwok et al. 201 2; Lin et al. 
2013) (supplementary table S1.1, Supplementary Material 
online). All these strains were derived from diseased 
individuals. 

We set out to perform a comprehensive comparative ge- 
nomic study of EBV. Rather than limiting ourselves to the pub- 
lished strains, we add new information by reconstructing EBV 
sequences extracted from lymphoblastoid cell lines (LCLs) be- 
longing to healthy individuals who were fully sequenced 
within the 1000 Genomes Project (1kG Project) (Durbin 
et al. 2010). To obtain a steady source of DNA for genetic 
studies, cell cultures of LCLs were established by transforming 
each donor's B cells with the B95-8 EBV strain. Thus, that 
strain is always present in all individual LCL cultures and may 
have eliminated all traces of endogenous EBV strains. 
However, B95-8 presents a specific 12-kb deletion in its 
genome that was spontaneously produced in the laboratory 
(Skare et al. 1 982), allowing us to distinguish between natural 
and artificial EBV strains. We were able to determine the pres- 
ence of natural EBV in at least ten individuals. A complete set 
of natural nucleotide variants was obtained from three of 
them. Adding these three new sequences to the already 
known complete EBV genomes, we were able to carry out 
analyses exploring genome diversity, phylogenetic 



relationships, and possible recombination events among EBV 
strains. Finally, we conducted an analysis of rates of protein 
evolution of viral genes. 

Materials and Methods 

Origin of Individual LCLs 

The samples sequenced within the 1000 Genomes Project 
came from LCLs established from blood at the nonprofit 
Coriell Institute for Medical Research. We used sequence 
reads from the 1kG release 20101123, with 1,103 available 
individual alignments to the reference human assembly. The 
analysis shown here is based on the 929 individuals who were 
sequenced using lllumina technology with paired-end se- 
quenced DNA libraries. We retrieved reads that do not map 
to the human genome and are labeled as "unmapped" for 
each individual in the 1kG Project database. 

Mapping 

Unmapped reads were mapped against the EBV reference 
genome (NC_007605), a composite of the B95-8 strain and 
1 2 kb from the Raji strain to correct the nonnatural B95-8- 
specific deletion from position 139,724 to 151,554. Previous 
to mapping, we masked the reference EBV genome using 
RepeatMasker (Smit et al. 1996-2010) to remove low-com- 
plexity regions. We also masked a large region of repeats be- 
tween positions 12,001 and 35,355. Mapping was performed 
using BWA (Li and Durbin 2010) with default parameters. 
After that, we removed duplicated read-pairs using 
SAMtools (Li et al. 2009). We kept only paired-reads with 
both reads mapping uniquely to the EBV genome. We deter- 
mined INDELs and realigned reads falling around them using 
GATK (DePristo et al. 201 1), and finally, we recalibrated base 
quality scores of the aligned reads by considering empirical 
Phred scores at the level of cycle effect (position in read) 
and dinucleotide mismatch distribution (the effect of the pre- 
ceding nucleotide). 

Variant Calling 

We called variants from pileup files produced by SAMtools 
based on cut-offs at the level of coverage (>10), phred- 
score (>20), and a minimum number of reads supporting a 
variant (>2). We filtered out those variants with extreme 
strand bias and an allele balance lower than 5%. INDELs 
were called with the same cut-offs using VarScan (Koboldt 
et al. 2009). Variants calling in the B95-8-specific deletion 
were refined using VarScan relaxing coverage cut-offs. Only 
variants present in the nonrepetitive region of EBV were used 
for subsequent analysis. In addition to the already masked 
region identified by RepeatMasker, we also ignored variants 
in specific repeats in the EBV genome reference genome 
(NC_007605) according to GenBank annotation. The total 



Genome Biol. Evol. 6(4):846-860. doi:10.1093/gbe/evu054 Advance Access publication March 28, 2014 



847 



Santpere etal. 



GBE 



length of the evaluated EBV genome was 135,486 bp out of 
-172 kb (-79% of the total). 

PCR Amplification, Cloning and Sanger Sequencing 

hDNA templates NA19114, NA19315, and NA19384 were 
purchased from Coriell Cell Repositories. Oligonucleotide 
primer pairs were EBNA-2Fw: 5'-ggatgcctggacacaagag-3' 
and EBNA-2Rev: 5'-tgtgctggtgctgctgg-3' # BBLF4Fw: 5'-agac- 
gatgcagggaatgc-3' and BBLF4Rev: 5 / -agaggcgctctcgtccac-3 / , 
LF2Fw: 5'-agccactgaggaagatctgg-3' and LF2Rev: 5'-gaagct- 
taccccggaggag-3'. The High Fidelity Taq DNA Polymerase 
KAPA HiFi (Cultek) was used and PCR amplification conditions 
comprised an initial denaturation of 3 min at 95 °C, followed 
by 35 cycles of 98 °C for 20 s, 60 °C for 1 5 s, and at 72 °C for 
1 5 s with a final extension at 72 °C for 5 min. The PCR prod- 
ucts were purified according to the QIAquick Gel Extraction kit 
instructions (Qiagen). The DNA fragments of interest were 
ligated into the pGEM-T easy vector at 4°C overnight accord- 
ing to manufacturer's instructions (pGEM-T Easy cloning kit, 
Promega) (in the case of the amplified product using EBNA-2 
primers with the NA1 91 14 template, DNA was digested with 
BspHI and the undigested fragment was selected for ligation). 
The products were transformed into competent DH5oc bacte- 
ria. Positive recombinants were selected on LB-ampicillin/IPTG/ 
X-gal plates and plasmid DNA was purified from overnight 
culture using QIAGEN Miniprep kit (Qiagen). The presence 
of the fragment of interest was verified by Rsal and EcoRI 
restriction digestion. Positive DNA plasmids were sequenced 
by Macrogen, Inc. 

Analysis and Representation 

Individual LCLs EBV genome sequences were obtained by in- 
corporating called SNVs into the reference EBV sequence. 
Most statistical analyses were performed using R packages. 
Manipulation of sequence intervals was carried out with 
IRanges (Pages et al.). Principal component analysis (PCA) 
was performed using the R package prcomp (R 
Development Core Team 201 2), on SNVs without scaling var- 
iables to avoid overcontribution of rare variants (McVean 
2009). Significance level of all principal components (PCs) 
was computed calculating the Tracy-Widom statistic to each 
PC eigenvalue (Johnstone 2001; Patterson et al. 2006). We 
estimated putative recombination breakpoints among EBV 
strains using Recco (Maydt and Lengauer 2006). Each recom- 
bination block was compared pairwise measuring genetic 
identity with the seqinr R package (Charif and Lobry 2007). 
Plots were performed using the ggplot2 package (Wickham 
2009) and Circos software (Connors et al. 2009). 

Phylogenetic trees were constructed using Phyml (Guindon 
et al. 2010) with the command line: phyml -I OUR_SEQs.phy 
-d nt -b 1000 -m K80 -v e -c 1 -a e -o n -s NNI with 1 ,000 
bootstrap. We performed 6N/6S analysis using model 0 in 
codeml from PAML (Yang 2007), which gives one co ratio 



for all lineages and considering the global tree resulting 
from the alignment of the whole EBV sequences. The NSites 
parameter, which sets variation between sites, was used to 
test one co ratio (NSites = 0) or different cos values between 
codons (NSites = 7, which allows co between 0 and 1; and 
NSites = 8, which allows co to be greater than 1). Different 
models were compared as usual in these cases. First NSites 
0 versus NSites 0 fixing co at 0.22 or 1 were compared with a 
log likelihood test with one degree of freedom. NSites 7 versus 
8 was compared with a log likelihood test with two degrees of 
freedom. Robustness of co estimates were assessed by produc- 
ing all possible topologies with eight final branches using 
PAUP* 4.0b10 (Swofford 2003) and reestimating co in each 
of them. 

Finally, alignment visualization tools included IGV viewer 
(Thorvaldsdottir et al. 2013) and Geneious Pro 5.6.3. 

Results 

Identification of Natural EBV Strains 

To evaluate the presence of natural EBV strains, we measured 
coverage in the region corresponding to the 12-kb B95-8 de- 
letion and compared it with the average coverage in the rest 
of the masked viral genome (127,219 bp). Overall EBV cover- 
age varied greatly among the analyzed individuals, ranging 
from Ox to 1 ,622.45 x. The mean and median coverage 
were 111.9x and 65. 9x, respectively. We required that 
each strain presented a minimum median coverage of 20x 
to be included in the analysis, which left a set of 784 individ- 
uals. The average median coverage for them was 123.3x, 
with 90% of the individuals presenting a coverage >20x in 
at least 99.9% of the genome (supplementary material S2, 
Supplementary Material online). 

Uniquely mapping reads (mapped only once) in the B95-8 
deletion indicate the presence of wild-type EBV. However, the 
B95-8 deletion contains two repeat regions (IR4 and DRright) 
that could generate false positives. Thus, coverage was calcu- 
lated only for the single-copy region immediately after these 
repeats (7kb between positions 144,445 and 151,554). The 
results suggested the presence of natural strains in some LCLs 
(fig. 1). The median coverage in the deletion region in the 784 
individuals was 0.002 x. A total of 61 LCLs showed no 
uniquely mapping reads at the deletion coupled with an over- 
all median EBV coverage of at least 20x, suggesting that they 
are exclusively infected by the transforming strain. Traces of 
wild-type EBV were detected in 71 1 LCLs, but in these cases 
mean coverage in the deletion region was <1x, not high 
enough for variant calling. Finally, 1 2 individuals (1 .53%) pre- 
sented a mean coverage greater than 1 x in the B95-8 dele- 
tion, with five of them reaching a coverage >10x. We set a 
threshold for mean deletion coverage plus one standard de- 
viation (2.1 x) to bring any individual to further analysis. This 
left a set of ten individuals with detectable natural EBV levels. 
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All ten naturally infected LCLs are shown in table 1 ordered 
according to the proportion of natural EBV coverage, as esti- 
mated by the ratio between average coverage present in the 
B95-8-specific deletion and overall EBV average coverage. 

Five out of the ten cell lines contained a proportion of the 
natural virus that is around or below the 5% allele balance 
cut-off that was used to call variants (see Materials and 
Methods). However, in the five LCLs with higher natural 
EBV load, the proportion of natural EBV ranged from -16% 
to -80%. These were four Kenyans (LWK, individuals 
NA19384, NA19380, NA19315, and NA19474) and one 
Yoruba (YRI, individual NA19114). For these individuals, we 
estimated the expected proportion of their EBV genomes cov- 
ered by natural EBV, assuming that coverage follows a Poisson 
distribution. The probability of having all EBV genome covered 
by at least two reads of the natural strain is given by 

P(X > 2) = (1 - e~ x - Xe- X Y 

where X is the average coverage per nucleotide at the B95-8 
deletion and k the total length of the viral genome in base 
pair. This simple calculation suggests that, with a 95-100% 
probability, all endogenous viral genome is covered by at least 



two natural EBV reads in individuals NA191 14, NA1931 5, and 
NA19384, with the figure dropping to 84.5% and 11.5%, 
respectively, for individuals NA19380 and NA19474. In con- 
clusion, natural EBV can be detected in at least three cell lines 
with reliability high enough for variant calling. Of course, this is 
a probabilistic approach, and even if probabilities are quite 
high it is not possible to have absolute guarantee that there 
are no regions with no coverage from the natural genome. 

Measuring Nucleotide Diversity 

All evidence indicates that variability levels in the transforming 
strain B95-8 are extremely low. Only 279 positions were pu- 
tatively polymorphic among the 61 LCLs, which were infected 
solely with the transforming strain. Out of these 279 variants, 
only 7 showed an allele balance for the alternative allele 
greater than 0.8, suggesting that they are sequencing arti- 
facts. Moreover, all 279 variants, but one, were present in a 
single individual. Five individuals share this common variant, 
which is found at position 70,269 in a region with no anno- 
tated functional elements. In summary, each of these 61 in- 
dividuals is infected by a B95-8 strain which is nearly identical 
to the reference. Interestingly, this allows for an estimation of 




Deletion Whole EBV 

genome 



Fig. 1 . — Boxplot of the coverage for all LCLs in the whole EBV genome (right) and in the B95-8-specif ic deletion (left). The inner panel displays a zoom in 
the coverage scale that shows that coverage at B95-8 is nonzero, suggesting the presence of natural EBV in some LCLs. 
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Table 1 

Basic Statistics for the Ten Naturally Infected Individuals 



IND 


Origin 


Mean EBV 


Mean Coverage 


% of 


No. of 


Mean Variants 


Ts/Tv 








in Deletion 


Natural EBV 


Variants 


AB (%) 




NA20783 


TSI 


277.67 


5.91 


2.13 


11 


0.34 


1.2 


NA18507 


YRI 


115.22 


3.08 


2.67 


27 


0.06 


1.455 


NA20348 


ASW 


75.53 


3.30 


4.36 


121 


0.07 


1.814 


NA18923 


YRI 


68.24 


3.18 


4.66 


77 


0.07 


2.208 


NA20524 


TSI 


119.84 


6.21 


5.18 


170 


0.06 


1.698 


NA19114 


YRI 


246.15 


38.38 


15.59 


445 


0.17 


1.967 


NA19474 


LWK 


56.91 


13.67 


24.03 


602 


0.20 


1.894 


NA19315 


LWK 


62.84 


18.04 


28.70 


628 


0.25 


1.881 


NA19380 


LWK 


38.44 


16.39 


42.65 


556 


0.35 


1.78 


NA19384 


LWK 


36.16 


29.56 


81.75 


576 


0.70 


1.73 



Note. — Shadowed rows indicated individuals with a higher endogenous EBV proportion. AB, Allele balance; TSI, Toscani in Italia; ASW, 
Americans of African ancestry in SW USA; YRI, Yoruba in Ibadan, Nigeria; LWK, Luhya in Webuye, Kenya. 
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Fig. 2. — Scatter plot where each dot indicates the number of variants 
called in each LCL. Light blue dots represent LCLs with strong evidence of 
the presence of natural EBV strains (from ten different individuals). The size 
of the dot indicates the mean allele balance found in each LCL. Five LCLs 
clearly show a large amount of genetic variation, as they harbor the great- 
est proportion of natural EBV. 

the amount of artifactual variants expected in individuals that 
carry the natural strain: given that each one of these 61 LCLs 
presents five low-quality variants on average, a similar number 
of false positives should be expected in individuals that also 
carry variants from the endogenous, wild-type EBV. 

In sharp contrast to the lines infected solely with the trans- 
forming strain, diversity levels were much higher in all the LCLs 
showing high coverage at the B95-8 deletion. Allele balances 
were also higher, indicating that these variants had been con- 
fidently called (table 1 and fig. 2). In total, we observed 1 ,534 
single nucleotide variants (SNVs) in the 1 0 LCLs with the high- 
est coverage at B95-8 deletion. As expected if the 



endogenous virus contributed them, these are high-quality 
variants and their allele balance is highly correlated with the 
proportion of natural EBV reads over total EBV coverage 
(/? 2 = 0.79, P= 0.00056, see table 1). 

Further evidence that the nucleotide variants detected in 
the 10 LCLs with coverage in the B95-8 deletion are real 
comes from their transition to transversion ratios (Ti/Tv), see 
table 1. The Ti/Tv ratio for variants in the 61 LCLs infected 
solely with the B95-8 strain is quite low (~1), indicating that 
these variants are mostly sequencing errors. In the 10 individ- 
uals with evidence of natural coinfection, the Ti/Tv ratio is 
around 1 .8 (1 .77), closer to the ratio of 2, which is expected 
in nature. For NA1 91 1 4, NA1 93 1 5, and NA1 9384 individuals, 
with the highest proportion of endogenous EBV, the Ti/Tv 
ratio in their 1,099 variants is even higher (1.86). Finally, we 
observed that not a single variant is shared between the two 
groups of LCLs: the 61 LCLs infected only with B95-8 and the 
10 LCLs coinfected with endogenous EBV. The number of 
variants called and the Ti/Tv ratios for each subset of individ- 
uals can be found in supplementary table S1.2, 
Supplementary Material online. In summary, all these obser- 
vations, together with the practical absence of variability in the 
B95-8 strain, indicate that rather than being mutations ac- 
quired in vitro, the vast majority of the 1,535 variants are 
due to the endogenous EBV virus which originally infected 
the individuals sampled within the 1000 Genomes Project. 
Therefore, with the exception of the RepeatMasked regions, 
we can reconstruct the full EBV genome from these three 
strains. Supplementary material S3, Supplementary Material 
online, contains a detailed explanation about the effects of 
applying different cut-offs on the accuracy of our variant call- 
ing in our three LCLs. 

Experimental Validation of the Presence of Natural 
EBV Sequences 

To prove experimentally the coexistence of different strains 
in these LCLs and to validate the variants that we had 
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Fig. 3. — Genome Browser snapshot of the reads from the three individuals from which we obtained high-quality EBV genomes aligned against EBV type 
1 (top panel) and type 2 (bottom panel) reference genome. The snapshot shows the EBNA-3reg\on, which was used to discriminate EBV type 1 from type 2. 
Large regions with no coverage are observed in the type 2 mappings, demonstrating that all three cell lines are infected only by type 1 EBV strains. 



determined, the DNA of the three LCLs for which we had 
performed high-quality variant calling was obtained from 
the Coriell Institute. Primers were designed to PCR amplify 
fragments of the EBNA-2, LF2, and BBLF4 genes, spanning 
regions containing called variants. By cloning and sequencing 



we obtained and analyzed sequences from these individuals 
(supplementary fig. S4.1, Supplementary Material online). 
All the sequences we predict for the natural strain in these 
individuals are always found and, thus, they are confirmed. 
Moreover, all of the variants that we had previously identified 
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in silico for NA19315, NA19384, and NA191 14 were present 
and, crucially, we did not find any new unpredicted variants at 
a level of confidence that would allow for their calling, which 
vouches for the completeness of our in silico analysis. 
Additionally, the transforming strain sequence was detected 
in fragments of the EBNA-2 and BBLF4 genes from all individ- 
uals, and it was invariably identical to the B95-8 reference 
sequence. As expected, LF2, a gene located in the B95-8 
deletion, only presented natural sequences in all three LCLs. 
These results confirm both the presence of natural viruses and 
the high accuracy of our variant calling for individuals 
NA19315, NA19114, and NA19834. 

Classification of All the Available EBV Strains According 
to Geography 

We started by classifying the three genomes we had obtained 
according to current criteria. To do so, we used a common 
method of distinguishing EBV types that is based on the fact 
that nucleotide differences in the EBNA-2 and EBNA-3 genes 
are much higher between than within types, with large re- 
gions being exclusive of one or other EBV type. Mapping EBV 
reads from the diagnostic region EBNA-3A-3B-3C of the three 
called individuals against the EBV type 2 reference genome 
(fig. 3) shows no uniquely mapping reads in the large regions 
exclusive of the EBV type 2 version of EBNA-3 genes, indicat- 
ing that all natural EBV genomes present in these individuals 
are type 1 . Many substrains of EBV type 1 have been defined 
by polymorphisms at genes such as EBNA-1 , EBNA-2, LMP-1 , 
LMP-2a, or BZLF1 . From such studies, some particular substi- 
tutions are found predominantly in different regions of the 



world. All the polymorphism from the three strains reported 
here are consistent with the known African origin of the 
donor individuals. (For a detailed analysis of polymorphisms 
in these four genes, see supplementary material S5, 
Supplementary Material online.) 

To take advantage from full-genome information in 
classifying EBV genomes, we performed a PCA including the 
reference sequence plus all fully sequenced EBV strains that 
are available in literature and presented low counts of ambig- 
uous nucleotides (Lin et al. 2013): AKATA, GD1, MUTU, and 
AG876 (fig. 4 and supplementary table S1 .1, Supplementary 
Material online). The first PC explains around 37% of variance 
(P= 0.004) and separates type 1 from type 2 strains. The 
second component explains around 27% of the variance 
(P= 0.014) and separates Asian strains from African and ref- 
erence strains. 

Phylogenetic Relationships between EBV Strains 

Constructing a phylogenetic tree with the eight high-quality 
EBV sequences (three from the present study and five previ- 
ously available best-quality sequences from diseased individ- 
uals) confirms and expands the PCA results presented earlier. 
It is immediately obvious from figure 5a that the AG876 
branch is longer than any other, indicating greater divergence, 
that NA19315 shares a common ancestor with MUTU, and 
that all African plus B95-8/Raji strains are clearly separated 
from the two Asian strains (AKATA and GD1). We repeated 
the phylogenetic tree using only the 12 kb in the B95-8 dele- 
tion (fig. 5b). Asian strains still appear separated from the rest, 
NA19315 and MUTU remain in the same clade, but the 
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classification of NA19384, NA19114, AG876, and Raji frag- 
ment is unclear, probably due to lack of statistical power. 
However, NA19114 and the Raji fragment still group to- 
gether. It is noteworthy that the AG876 branch is now 
much shorter and becomes integrated with most African 
strains. 

Finally, we reconstructed the tree excluding latency genes, 
where most of intertypic variation is found (Dolan et al. 2006), 
and observed (fig. 5c) that the AG876 branch is again shorter 
and all African strains are clearly separated from Asians. 



However, the tree from figure 5b shows that NA19384, 
from Kenya, shares a most recent common ancestor with 
Asian strains. These findings constitute clear proof that differ- 
ent genomic regions present different genealogies. The most 
likely cause is, of course, intertypic and intratypic recombina- 
tion events. 

Recombination between EBV Genomes 

To try reconstructing the recombinational history of all the 
available EBV genome sequences, we studied levels of identity 
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amongst sequences. We used Recco (Maydt and Lengauer 
2006) to infer putative recombination breakpoints considering 
each of our three genomes (one at a time) as the recombinant 
product of the five good quality strains published so far. Then, 
we determined "haplotype" blocks as pieces of sequence be- 
tween any two breakpoints identified in any EBV genome. 
Finally, we calculated an identity distance matrix for each 
one of these blocks. Figure 6a and b shows that the EBV 
genomes from NA19114 and NA19384 share haplotype 
blocks mainly with B95-8/Raji and MUTU strains, with fewer 
blocks closer to the Asian strains (GD1 and AKATA); while 
NA19315 shows almost 100% block identity with MUTU 
(fig. 6c). The same analysis performed on AG876 indicates 
that its EBNA-2 and EBNA-3 blocks are divided among all 
other genomes and show very low sequence identity 



(supplementary fig. S6.1f, Supplementary Material online). 
The rest of the EBV type-2 genome is divided in haplotype 
blocks distributed evenly among the Asian and B95-8/Raji 
strains. The genome of the B95-8/Raji strain appears as a com- 
posite of MUTU and AG876 haplotypes plus a few AKATA 
and GD1 blocks (supplementary fig. S6.1a, Supplementary 
Material online). AKATA and GD1 strains can be reciprocally 
explained (supplementary fig. S6.1c/ and e, Supplementary 
Material online). 

The Genomic Distribution of EBV Nucleotide Variants 

As usual, different classes of genes present varying levels of 
genetic diversity. We calculated Watterson's estimate of 0, the 
parameter of neutral molecular evolution (Watterson 1975) 
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Fig. 7. — Circos plot showing genome-wide single-nucleotide variant diversity in the three LCLs with the highest natural EBV load. Each green circle 
corresponds to one individual. Red dots indicate variable positions. Blue dots indicate variants present in all three EBV sequences. The innermost circle shows 
diversity on 500-bp windows considering all variants together. In the outer circle, latency genes (red) and lytic genes (blue) are represented. Light gray 
shadows indicate repetitive regions, and dark gray sector indicates the location of the B95-8-specific deletion. 



using SNVs called in all coding regions and separately for 
latency and lytic genes, as well as for UTRs and introns 
(supplementary table S1.3, Supplementary Material online, 
and fig. 7). Latency genes harbor more diversity than any 
other element in the genome and they double the amount 
of diversity found in the rest of coding elements. This trend 



may be indicative of very different evolutionary pressures in 
the two classes of genes. 

As for genomic elements, they also present different levels 
of variation (supplementary tables S1.3 and S1.4, 
Supplementary Material online, and a more detailed descrip- 
tion in supplementary material S5, Supplementary Material 
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online). Focusing on gene-disrupting mutations, we found a 
mutation introducing a premature stop codon in the RPMS1 
gene of the NA19315 sequence (nucleotide C160T, residue 
Q54X). Although the RPSM1 ORF can be translated in vitro, 
the RPMS1 protein has never been detected in natural infec- 
tions (Al-Mozaini et al. 2009). Finally, two different frameshift 
deletions were found in the LF3 coding region for NA19315 
and NA19384. Both mutations create a new frame that pro- 
duces a slightly longer version of LF3 version (supplementary 
material S5, Supplementary Material online). It has been 
shown that the LF3 coding frame is not conserved in 
AKATA, and many start and stop codons in the 5' of the 
gene in both AKATA and MUTU suggest a primary noncoding 
function for LF3 (Lin et al. 2013). 

Rates of Protein Evolution 

Given that different genes present different diversity levels, it 
makes sense to evaluate whether they are evolving at different 
rates. The most widely used method to study rates of protein 
evolution is to estimate the ratio between dN, the rate of 
nonsynonymous substitutions along a branch of the phyloge- 
netic tree, and the corresponding dS, the rate of synonymous 
substitutions. Using the multiple sequence alignments of our 
three EBV sequences coming from healthy individuals plus 
the five known high-quality EBV strains, we calculated co 
(the dN/dS ratio), for all coding regions by means of PAML's 
codeml (Yang 2007) and using the global tree. Supplementary 
table S1 .4, Supplementary Material online, shows the average 
co value for all genes and for two different groups of genes: 
those expressed during the lytic phase and those expressed 
during latency. The average co value for lytic genes (0.22) in- 
dicates that most EBV genes are under similar constraints than 
average human genes (-0.25) (Kosiol et al. 2008). In contrast, 
co is significantly increased in latency genes, with an average of 
0.84 (Wilcoxon test P value 0.000015) (supplementary table 
S1.4, Supplementary Material online). 

We also studied dN/dS at the individual gene level in order 
to identify particular genes under strong purifying or positive 
selection. In general terms, genes that are very conserved at 
the protein sequence level tend to be considered under strong 
purifying selection, while, in contrast, genes with relatively 
larger rates of nonsynonymous than synonymous changes 
may have achieved that through the effect of adaptive evolu- 
tion). We had estimated co values for each gene and obtained 
likelihoods for every co value under the M0 model, considering 
one co ratio for the whole tree. We then fixed co at 0.22 (the 
average co value in lytic genes) obtaining a second likelihood 
value, always under Model 0. Finally, we performed likelihood 
ratio tests (LRTs) to determine what were the genes whose co 
values were significantly different from 0.22. We performed 
multiple test correction with a FDR g-value cut-off of 5% 
(supplementary material S7, Supplementary Material online). 



A total of 1 1 genes are under strong purifying selection, as 
indicated by their significantly lower co values. Five of them are 
related to DNA replication and nucleotide metabolism, includ- 
ing BALF5 (the catalytic subunit of the viral DNA pol), BALF2 
(part of the replication fork machinery), BBLF4 (helicase), 
BGLF4 (kinase involved in many functions required for efficient 
lytic DNA replication), and BORF2 (a ribonucleosidereductase). 
One of the most conserved genes, BcLF1 , encodes the major 
capsid protein (MCP), which self-assembles in pentamers and 
hexamers to form the virus capsid. Three other very 
constrained genes are BALF3, BGRF1 , and BBRF1, which are 
involved in DNA packaging inside the capsid during the lytic 
phase. Interestingly, the other two genes are related to 
tegument and cell attachment (BSRF1) and/or interaction 
with host proteins (LF2, which interacts with IRF7). 

We then repeated the codeml Model 0 analysis fixing co to 
1 (the expected value of co under neutrality) and compared 
with the same model with free co. Only EBNA-1, which 
encodes for essential proteins for the establishment and main- 
tenance of EBV latency, presented a co value significantly 
greater than one (co=1.78). This constitutes evidence that 
the EBNA-1 gene has been under positive selection. 

We found six nonlatency genes to be accelerated relative to 
average co values found in lytic genes. BRRF2 and BFRF2 
encode for two viral tegument proteins with unknown func- 
tions. BRRF2 is known to be expressed and translated because 
peptides BRRF2 were abundantly detected in LCLs (Johannsen 
et al. 2004). BFRF2 has only clear evidence at transcript level 
(Concha et al. 2012). BARF0 has evidence at transcript level 
but may not be translated in vivo. Finally and more interest- 
ingly, we found BKRF2, BLLF1 , and BLLF2 to be accelerated. 
BKRF2 encodes for the envelope glycoprotein L required for 
the fusion of viral and plasma membranes for a virus to be able 
to enter into the host cell and interact with the host's integrins 
(Connolly et al. 2011). BLLF2 ORF comprised BLLF1 coding 
region. BLLF1 encodes for gp350, which initiates the virus 
attachment to the host's cells via interaction with B cells 
CR2 (Young et al. 2008). 

One potential problem can arise from forcing a single tree 
topology produced by whole-genome alignment to genes 
that have different genealogies, as we did in the calculations 
above, because assuming the wrong tree may lead to mises- 
timation of co. To control for that source of error, we estimated 
co under all possible tree topologies produced by eight 
sequences (supplementary material S8, Supplementary 
Material online). In general, co estimates for latency genes 
are robust to different tree topologies. In particular, co is 
significantly higher than 1 (P< 0.05) in 88.3% of the possible 
tree topologies of EBNA-1. Similarly robust co values are 
observed also in lytic genes BLLF1 and BRRF2. However, co 
estimates for BLLF2, BFRF2, and BKRF2 seem to be more 
affected by tree topology, making these measures less reliable 
(supplementary material S8, Supplementary Material online). 
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It is possible that, even if a gene as a whole is under strong 
purifying selection, some of its codons or groups of codons 
have been the targets of positive selection. In order to test 
specifically for this hypothesis while controlling for contribu- 
tion of a relaxation of selective constraints, we performed 
another codeml analysis that compares two different models 
that allow variation of co values between codons. The first 
model (M8) allows a subset of codons to have co greater 
than one. That model is usually tested against a null model 
that does not permit any codons to have co greater than 1 
(M7). Even when using a 5% q-value cut-off, the LTRs showed 
that M8 explained our data better than M7 in eight of the 
latency genes: EBNA-1, -3 A, SB, -3C, -LP, LMP-1, and 
LMP-2A, which seem to have undergone accelerated evolu- 
tion under the influence of positive selection. BLLF1, BLLF2, 
BARFO, BFRF2, and BRRF2 but not BKRF2 also show this sig- 
nature at codon level. In addition, 20 other genes whose co 
values are not particularly high and, thus, did not show up in 
the previous analyses also bear the signature of adaptive evo- 
lution (supplementary material S7, Supplementary Material 
online). Two examples are BNRF1, which encodes the major 
tegument protein (MTP) involved in virus entry, attachment, 
and membrane fusion, and BVRF2, encoding one capsid 
protein and BFRF1A and BFLF1, which are involved in packag- 
ing; the later being also responsible for placing capsids in the 
nuclear replication compartments. Finally, we detected 
signatures of positive selection at some codons in genes 
whose complete sequence appears particularly constrained 
such as BALF2 and BALF3, indicating putative functional sites. 

Discussion 

We have carried out the most complete comparative geno- 
mics study of EBV so far. To do so, we combined published 
EBV genomes from diseased donors with our own survey of 
sequence data from the 1kG Project, which provided EBV 
genomes from healthy individuals. In total, we deeply ana- 
lyzed eight EBV genome sequences. 

One major obstacle that we had to overcome to carry out 
our study was the presence of the B95-8 transforming strain 
in most of the 1 kG Project LCLs (-92%). However, we could 
obtain clear evidence of co-occurring natural EBV strains in 1 0 
out of 946 individuals. Given the high prevalence of EBV in- 
fection, the number of individuals in which we observe co- 
occurrence of natural EBV may seem surprisingly low. To 
understand that apparent paradox, we need to consider that 
in healthy individuals (such as those from the 1 kG Project), EBV 
persists in peripheral blood B cells at a frequency of about 1-50 
infected cells per million B cells (Babcock et al. 1998). Given 
that around 2 million B cells are contained in 1 ml of blood, we 
would expect to find low counts of infected lymphocytes per 
blood extraction. Moreover, only -3% of the peripheral leu- 
kocytes of adult donors become immortalized by EBV (Sugden 
and Mark 1977). This clearly reduces the chances of a B95-8 



transformation of an already naturally infected B cell. In addi- 
tion, although initially EBV-immortalized B-cell lines are poly- 
clonal, after prolonged culture in vitro, cell lines become 
oligoclonal or monoclonal (Ryan et al. 2006). Except for a 
small subset of immune-compromised individuals, most EBV- 
associated tumors and LCLs contain monoclonal EBV (Gulley 
et al. 1994). This monoclonality may be indicative either of 
genetic drift or of natural selection among cell clones. At any 
rate, the most common scenario is that a few clones quickly 
become dominant (Ryan et al. 2006), reducing the probability 
of finding clones infected with the natural EBV relative to 
clones infected only with the B95-8 strain. 

Nevertheless, it is likely that by increasing sequencing cov- 
erage in 1kG Project samples, it would be possible to detect 
more LCLs with some wild-type EBV. Actually, a recent study 
provides analyses of publicly available RNA-seq data from 319 
LCLs derived from 143 individuals (Arvey et al. 2012), some of 
them overlapping our set of samples. By looking at their pub- 
lished RPKM values (Reads per Kilo Base per Million) of the LF3 
gene, which is located in the deletion of the B95-8 transform- 
ing strain, we can deduce that at least 76 LCLs (-50%) con- 
tain some load of natural EBV strains. 

Infections by multiple EBV strains are most commonly ob- 
served in immunocompromised individuals (Tierney et al. 
2006), but also happen in healthy individuals (Lung et al. 
1988; Walling et al. 2003). Regarding only the two major 
EBV types, rates of coinfection of EBV type 1 and type 2 
range from 0% to 53% (Kunimoto et al. 1992; Srivastava 
et al. 2000; Walling et al. 2003). High frequency of coinfection 
with both type 1 and type 2 EBV has been reported in MS 
patients (Santon et al. 201 1). Given that we do not find evi- 
dence of coinfection by different natural EBV types, it is worth 
noting that we might have underestimated intraindividual EBV 
diversity if our variants came only from a dominant isolate or 
the LCLs that were sequenced within the 1 kG Project were the 
result of a selection bias toward transformation-competent 
EBV isolates (Tierney et al. 2006). By amplifying and cloning 
fragments of EBV of our three LCLs with strong evidence of 
natural coinfection, we have been able to prove the existence 
and accuracy of our predicted natural strains, as well as the 
existence of major dominant isolates of natural EBV coexisting 
in these studied LCLs. 

Are the Variants Determined Really Natural? 

Several facts support that the variants we have studied were 
wild-type variants present in the B cells of 1 kG Project individ- 
uals before the establishment of LCLs. The strongest piece of 
evidence is that LCLs with no sign of coinfection by natural 
EBV contain very low levels of polymorphism, which given 
their low Ti/Tv ratio and extreme values of allele balance are 
likely to be sequencing errors. In this subset of 61 LCLs, we 
observed an average of only five low-quality variants per LCL, 
so it is unlikely that B95-8 variants are biasing our results. In 
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fact, the opposite is more likely: we cannot rule out that traces 
of natural EBV could account for some of the variants that we 
attribute to B95-8. Cloning and Sanger sequencing of frag- 
ments of EBV provide conclusive evidence that the B98-5 ver- 
sion of these fragments is identical to the published reference 
sequence. In short, we have determined that the B95-8 strain 
possesses a very stable genome that does not present any 
relevant contribution to the pool of variants that we deter- 
mined in our three natural strains. The same experiment also 
validated the confidence of our variant calling in the evaluated 
EBV fragments. However, an inherent limitation in our data 
remains. Given the contribution of reads of both natural and 
transforming strain, however the small the later may be, it is 
not possible to guarantee the total absence of false-negative 
variant calls. As explained in supplementary material S3, 
Supplementary Material online, this is particularly true for 
NA19315, but we estimate that even in the worst case sce- 
nario in which all very low frequency variants were true pos- 
itives, in NA1931 5 we would still have a -94% of sensitivity. 

Phylogeny and Recombination 

Phylogeny is not always consistent in different regions of the 
EBV genome, allowing the inference of recombination events, 
as previously reported by McGeoch and Gatherer (2007). By 
studying genomic blocks delimited by predicted recombina- 
tion breakpoints, we observed that current haplotypes are 
mixed inter- and intratypically. EBV type2 (AG876) is in fact 
a composite of high-identity blocks already present in all 
known type 1 strains. It has been proposed that the EBV 
type 2 strains come from a type 1 prototype strain that incor- 
porated EBNA-2 and EBNA-3 blocks from a closely related 
lymphocryptovirus (McGeoch 2001). Given the crucial role 
of these genes in latency establishment, it has been hypothe- 
sized that such acquisitions increased the virus' fitness as 
EBNA-2 and EBNA-3 blocks from type 1 and type 2 were 
maintained in the present combinations relative to a possible 
recombinant version between type 1 and type 2 (McGeoch 
2001 ). Another possibility would be that intensive substitution 
processes in those genes were driven by positive selection 
(McGeoch 2001). Having said that, inter-typic recombinants 
showing recombination breakpoints in the EBNA-3 gene block 
have been reported in healthy and immunecompromised in- 
dividuals from different geographical regions (Midgley et al. 
2000, 2003; Gorzer et al. 2006). During productive EBV rep- 
lication, intrastrain homologous and nonhomologous recom- 
bination events can occur and have been observed in 
repetitive regions of the genome (Walling et al. 1992; 
Walling and Raab-Traub 1994; Walling et al. 1999). Thus, 
our results warrant caution in classifying EBV strains: recom- 
bination events may confound relatively simple classifications 
based on single-gene polymorphisms such as those found in 
the EBNA-1 gene. 



Rates of Protein Evolution in EBV Genes 

We analyzed SNV and INDEL polymorphism in our three 
strains in relation to the B95-8/Raji reference genome and 
observed that diversity found in latency related genes is the 
highest compared with any other genomic element. EBV 
genome comparison between simian homologs of EBV had 
revealed that some latency genes diverged faster (Wang et al. 
2001). Confirming and expanding these results, we found 
that diversity in latency genes doubles the level of nucleotide 
variation found in lytic genes and is even higher than diversity 
found in introns. We calculated co values, studied the signa- 
tures of selection for all coding regions in EBV, and compared 
different selection models using a LRT. We acknowledge the 
well-known fact that accuracy and power of the LRT for 
detecting selection on genes depend on the number of 
sequences, with power being 100% at -15 sequences 
(Anisimova et al. 2001). By adding our three new sequences 
to five already reported EBV genomes, we increase the power 
to detect positive selection acting on genes; however, we still 
might be underpowered for some low-divergence genes. In 
particular, the limitations of site-prediction methods are 
greater when assessing protein evolution rates at the codon 
level (Nozawa et al. 2009) and thus it is clear that experimental 
validation of putatively selected residues would be needed. 

Our analysis provides evidence for the action of both puri- 
fying and positive selection. The average co value of lytic genes 
resembles the average of -0.25 found in primates (Kosiol 
et al. 2008), indicating similar evolutionary constraints in 
viral and host genes. We reported evidence for strong purify- 
ing selection acting upon 1 1 EBV genes. These include genes 
encoding proteins that work in crucial enzymatic activities 
such as the viral DNA polymerase (BALF5) or the helicase func- 
tion (BBLF4). One host-interacting element has also been the 
target of purifying selection: LF2. LF2 binds the central inhib- 
itory association domain of IRF7, inhibiting the interferon-me- 
diated immunity, which is the major defense raised by the host 
against viral infections (Wu et al. 2009). 

Latency genes, in contrast, present higher co values, indicat- 
ing that positive selection may be the cause of their high di- 
versity levels. Using various models we observed that EBNA-1, 
-3A, -3B, -3C, -LP, LMP-1 , and LMP-2A contain a subset of 
codons, which have experienced positive selection. The action 
of positive selection inferred from an increased proportion of 
nonsynonymous variation has already been described in LMP-1 
and EBNA-1 (McGeoch and Davison 1999; Walling et al. 
1 999). In addition to these genes, other coding regions related 
to virus attachment and entry in host cells {BLLF1 and BKRF2) 
show signatures of accelerated protein evolution rates. BRRF2 
and BFRF2 present a higher co value and, together with BNRF1 
and three other capsid and packaging-related genes, also 
show evidence of positive selection at codon level. BNRF1 en- 
codes for the MTP. The BRRF2 protein has been abundantly 
detected in LCLs by means of mass spectrometry (Johannsen 
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et al. 2004), and it is likely that both constitute viral tegument 
proteins. As these kinds of proteins can regulate viral attach- 
ment and immune evasion of viruses, evidence of selection in 
BLLF1, BKRF2, MTP, BRRF2, and BFRF2 suggests that knowl- 
edge may be gained from a detailed analysis of genetic varia- 
tion in these genes in different populations and pathology 
groups and from the study of functional implications of 
these variants in the biology and pathogenesis of EBV. 

Supplementary Material 

Supplementary materials S1-S9 are available at Genome 
Biology and Evolution online (http://www.gbe.oxfordjour- 
nals.org/). 
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